Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.
The course seems really interesting. It’s cool that we have an enthusiastic teacher. I wonder if many researchers in medicine use R markdown?
Took my first steps with R markdown and Github today. Haven’t managed to change the theme of GitHub pages, this is top priority so I’m working on it.
setup, setting the working directory and opening the data
## change the working directory of this r markdown file
knitr::opts_chunk$set(root.dir = "/Users/pecsi_max/Desktop/GitHub/IODS-project/data/")
setwd("/Users/pecsi_max/Desktop/GitHub/IODS-project/data/")
getwd()
#load the required packages
library(ggplot2)
library(GGally)
library(dplyr)
Opening the data “tbl_df” converts the data frame to a “data table” -format. To my knowledge, it is recommended when using dplyr, although I believe that in these excercises it’s not a big difference.
#read the data.
learn <- tbl_df(read.csv("/Users/pecsi_max/Desktop/GitHub/IODS-project/data/learning2014.csv"))
#structure of data
str(learn)
## Classes 'tbl_df', 'tbl' and 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
#dimensions of data
dim(learn)
## [1] 166 7
“learn” is a dataset of 166 students, with variables “gender”, “age”, “attitude”, “points”, “deep”, “stra” and “surf”. Last three are scales measuring different aspects of learning (deep, strategic or surface learning). The students scoring 0 points on exam are not included.
#create a plot matrix with all the variables
pairs_plot <- ggpairs(learn, mapping = aes(alpha = 0.5),
lower = list(combo = wrap("facethist", bins = 20),
continuous = wrap("points", size=0.1)),
upper = list(continuous = wrap("cor", alpha = 1)))
pairs_plot
#construct a model, choosing 3 variables as dependent vars: attitude, surface learning, strategic learning
model1 <- lm(points ~ attitude + stra + surf, data = learn)
summary(model1)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
There is statistically significant linear association, adj. R-squared = 0.192. Only attitude appears to be statistically significantly associated with exam points. Let’s change the model…
# testing to put all variables in the model
model_all <- lm(points ~ attitude + gender + age + deep + stra + surf, data = learn)
summary(model_all)
##
## Call:
## lm(formula = points ~ attitude + gender + age + deep + stra +
## surf, data = learn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4506 -3.2479 0.3879 3.5243 11.4820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.30246 5.25723 3.481 0.000644 ***
## attitude 0.34430 0.05978 5.760 4.24e-08 ***
## genderM -0.05858 0.92985 -0.063 0.949845
## age -0.09607 0.05396 -1.780 0.076929 .
## deep -1.04279 0.78438 -1.329 0.185606
## stra 0.95871 0.55150 1.738 0.084083 .
## surf -1.10891 0.84467 -1.313 0.191132
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.265 on 159 degrees of freedom
## Multiple R-squared: 0.2312, Adjusted R-squared: 0.2021
## F-statistic: 7.967 on 6 and 159 DF, p-value: 1.599e-07
Age and stra are perhaps the best predictors in addition to attitude, even though their p>0.05
Create a model with three best explaining variables
model_final <- lm(points ~ attitude + age + stra, data = learn)
summary(model_final)
##
## Call:
## lm(formula = points ~ attitude + age + stra, data = learn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## attitude 0.34808 0.05622 6.191 4.72e-09 ***
## age -0.08822 0.05302 -1.664 0.0981 .
## stra 1.00371 0.53434 1.878 0.0621 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
Adjusted R-squared = 0.2037, marginally better than in model 1. Adjusted R-squared of 0.2037 means that the model explains 20.4% in variance observed in exam points. The “adjusted” word means that the value is adjusted for the number of variables in the model.
Only attitude is statistically significantly associated with points; an increase of 1 in attitude brings 0.34 points in exam. 1 year increase in age takes 0.09 points in exam, but the relationship is not considered statistically significant. Strategic learning is “borderline significant”, associating perhaps to slightly better points. However, it should be stressed that attitude is the only variable clearly associated with exam scores.
plot(model_final, which = (c(1, 2, 5)))
#create a directory path, where the data is located
path <- paste(getwd(), "data", "alcohol", sep = "/")
#create the data
alc <- read.csv(path)
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "nursery" "internet" "guardian"
## [16] "traveltime" "studytime" "failures" "schoolsup" "famsup"
## [21] "paid" "activities" "higher" "romantic" "famrel"
## [26] "freetime" "goout" "Dalc" "Walc" "health"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
glimpse(alc)
## Observations: 382
## Variables: 36
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob <fctr> teacher, other, other, services, other, other, oth...
## $ reason <fctr> course, course, other, home, home, reputation, hom...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
‘alc’ is a dataset of students portugese schools, about their performance in two subjects, mathematics and portugese language. It contains variables on students’ education, family, school performance, and alcohol use, which is a special interest in this report.
For exploration, let’s create two datasets - alc_col and alc_other. Alc_col contains factors, alc_other supposedly continuous variables. As “failures” -variable seems to be 4-level, let’s add it to alc_cols as well.
#make indices which variables are columns and which are other than columns
alc_colinds <- which(map_lgl(alc, is.factor))
alc_col <- dplyr::select(alc, alc_colinds, failures, high_use)
alc_other <- dplyr::select(alc, -alc_colinds, -failures, high_use)
#plot explorative plots
alc_col %>% gather(key, value, -high_use) %>%
ggplot(., aes(x=value, fill = high_use)) + geom_bar() +
facet_wrap("key", scales = "free", shrink = TRUE)
## Warning: attributes are not identical across measure variables; they will
## be dropped
alc_other %>% gather(key, value, -high_use) %>%
ggplot(., aes(x=high_use, y=value)) + geom_boxplot() +
facet_wrap("key", scales = "free", shrink = T)
We select 4 interesting variables for analysis, let they be: famrel, sex, absences, goout.
Let’s do some exploratory analyses with these variables - we look at the possible differences in high and low alcohol consumption groups.
# create a dataset with the variables of interest
alc_log <- alc %>%
dplyr::select(famrel, sex, absences, goout, high_use)
#exploratory table
CreateTableOne(data = alc_log, strata = "high_use", vars = c("famrel", "sex", "absences", "goout"))
## Stratified by high_use
## FALSE TRUE p test
## n 268 114
## famrel (mean (sd)) 4.00 (0.89) 3.78 (0.93) 0.027
## sex = M (%) 112 (41.8) 72 (63.2) <0.001
## absences (mean (sd)) 3.71 (4.45) 6.37 (6.99) <0.001
## goout (mean (sd)) 2.85 (1.02) 3.72 (1.10) <0.001
In all our variables, there appears to be a statistically significant difference in either the ratio of heavy drinkers, or the mean value of continuous variable is different in those who drink much.
modbin <- glm(data = alc_log, high_use ~ famrel + sex + absences + goout, family = "binomial")
#look at the model
summary(modbin)
##
## Call:
## glm(formula = high_use ~ famrel + sex + absences + goout, family = "binomial",
## data = alc_log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7151 -0.7820 -0.5137 0.7537 2.5463
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.76826 0.66170 -4.184 2.87e-05 ***
## famrel -0.39378 0.14035 -2.806 0.005020 **
## sexM 1.01234 0.25895 3.909 9.25e-05 ***
## absences 0.08168 0.02200 3.713 0.000205 ***
## goout 0.76761 0.12316 6.232 4.59e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 379.81 on 377 degrees of freedom
## AIC: 389.81
##
## Number of Fisher Scoring iterations: 4
#calculate CI's
modbin_odds <- exp(cbind(OR = coef(modbin), confint(modbin)))
## Waiting for profiling to be done...
modbin_odds
## OR 2.5 % 97.5 %
## (Intercept) 0.06277092 0.01648406 0.2225470
## famrel 0.67449969 0.51060362 0.8869199
## sexM 2.75203774 1.66768032 4.6128191
## absences 1.08510512 1.04012840 1.1353940
## goout 2.15461275 1.70345866 2.7639914
All the variables are significantly associated with high alcohol use. Family relations are inversely related - the better the family relations, the lower the odds for drinking are. Being male increases the OR to 2,8. Absences and going out are related to higher chance of high alcohol use.
#let's add the probabilities to data frame
# JUHON VINKIT: treshold tulisi tähän, tätä muuttamalla voit säätää sensitiviteettiä ja spesifiteettiä
#create a dataset where we'll predict the values
alc_pred <- alc_log
alc_pred$prob <- predict(modbin, type = "response")
First, we’ll find out what is the optimal cut-off value for probability of high drinking (based on the model) for the student to be classified as high drinker. We plot the proportion of incorrect classifications against the cut-off value.
sapply(1:99, function(x) {
a <- alc_pred$prob > (x/100)
b <- a == alc$high_use
1-mean(b)
}
) %>% plot
alc_pred$pred <- (alc_pred$prob > 0.46)
#table
obs_table <- table(high_use = alc_pred$high_use, prediction = alc_pred$pred)
prob_table <- obs_table %>% prop.table %>% addmargins
obs_table
## prediction
## high_use FALSE TRUE
## FALSE 247 21
## TRUE 55 59
prob_table
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64659686 0.05497382 0.70157068
## TRUE 0.14397906 0.15445026 0.29842932
## Sum 0.79057592 0.20942408 1.00000000
in the table with absolute number of observations, the number of “true positives” is 5, and “false positives” is 14. “True negatives” are 254, and “false negatives” are 64. In prob_table, 66% were classified rightly as non-high drinkers, 16% were classified wrongly to non-high drinkers; 3% were classified wrongly as high-drinkers, and 13% were rightly assigned a high-drinker status.
#visualise the relation
ggplot(alc_pred, aes(x = prob, y = high_use, col = pred)) + geom_point()
we see that in high drinkers, the number of false predictions is rather high.
#define the loss function
loss_func <- function(class = alc_pred$high_use, prob) {
n_wrong <- abs(class - prob - 0.04) > 0.5
mean(n_wrong)
}
a <- alc_pred$pred == alc_pred$high_use
1-mean(a)
## [1] 0.1989529
# apply the loss function to different prediction strategies
# probability of high drinking is 0 in all
loss_func(prob=0)
## [1] 0.2984293
# probability of high drinking is 1 in all
loss_func(prob=1)
## [1] 0.7015707
loss_func(prob = alc_pred$prob)
## [1] 0.1989529
If we guess at randomly that students are non-high drinkers, the prediction error is around 30%. If we think that everyone are high drinkers, the error is 70%. With our model, the training error is “only” 20%
library(boot)
## Warning: package 'boot' was built under R version 3.2.5
##
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
##
## logit
cv <- cv.glm(data = alc_pred, cost = loss_func, glmfit = modbin, K = 10)
cv$delta[1]
## [1] 0.2120419
The cross-validation analysis is done with the function ‘cv.glm’. K is the number of pieces we split the data into. The first element of cv$delta is the prediction error (the second component would be a value adjusted for some bias). So, with 5-fold cross-validation, our model has 20,2% prediction error, which is better than the apprx. 26% error in DataCamp excercise!
library(boot)
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "nursery" "internet" "guardian"
## [16] "traveltime" "studytime" "failures" "schoolsup" "famsup"
## [21] "paid" "activities" "higher" "romantic" "famrel"
## [26] "freetime" "goout" "Dalc" "Walc" "health"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
#2nd model
modbin2 <- glm(data=alc, high_use ~ sex + school + age + address + famsize + Pstatus + reason + internet + failures + schoolsup + activities + romantic + freetime + goout + absences + G3, family = "binomial")
prob2 <- predict(modbin2, type = "response")
train2 <- loss_func(prob=prob2)
cv2 <- cv.glm(data=alc, cost = loss_func, glmfit = modbin2, K = 10)
#3rd model
modbin3 <- glm(data=alc, high_use ~ sex + school + age + failures + schoolsup +
activities + romantic + freetime + goout + absences + G3, family = "binomial")
prob3 <- predict(modbin3, type = "response")
train3 <- loss_func(prob=prob3)
cv3 <- cv.glm(data=alc, cost = loss_func, glmfit = modbin3, K = 10)
#4th model
modbin4 <- glm(data=alc, high_use ~ sex + school + schoolsup + reason + activities + goout + freetime + absences, family = "binomial")
prob4 <- predict(modbin4, type = "response")
train4 <- loss_func(prob=prob4)
cv4 <- cv.glm(data=alc, cost = loss_func, glmfit = modbin4, K = 10)
#5th model
modbin5 <- glm(data=alc, high_use ~ sex + activities + goout + absences, family = "binomial")
prob5 <- predict(modbin5, type = "response")
train5 <- loss_func(prob=prob5)
cv5 <- cv.glm(data=alc, cost = loss_func, glmfit = modbin5, K = 10)
#now create a vector of testing errors and training errors
cv_vector <- c(cv2$delta[1], cv3$delta[1], cv4$delta[1], cv5$delta[1])
train_vector <- c(train2, train3, train4, train5)
length_vector <- c(16, 11, 8, 4)
plot(cv_vector, length_vector)
plot(train_vector, length_vector)
The trend in testing error (calculated with 10-fold cross-validation) appears to be increasing as the number of variables goes up. The training error is not necessarily so, as with 4 predictors the training error appears to be greater….
Thanks for reading!
We use the ‘Boston’ dataset from the MASS package. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It contains variables about housing in Boston; more info here
#load the data & first glimpses
library(MASS)
data("Boston")
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
The data contains 506 rows and 14 columns, i.e. variables. All variables are numeric (type ‘dbl’) except ‘chas’ and ‘rad’, that are ‘int’ and appear as factors.
#Visualize the histograms
Boston %>% gather(key, value) %>%
ggplot(., aes(x=value)) + geom_histogram(bins = 30) +
facet_wrap("key", scales = "free", shrink = T)
#visualize the correlations. First, calculate the p values for correlations
p.mat <- cor.mtest(Boston, method = "spearman", exact = FALSE)
#then, calculate and plot correlations. The function leaves out insignificant correlations
cor(Boston, method = "spearman") %>% corrplot(method = "circle",
type = "upper",
tl.cex = 1.3, cl.pos = "b",
p.mat = p.mat$p, sig.level = 0.05, insig = "blank",
diag = FALSE)
In many variables, the distribution is rather skewed, or a single value is heavily represented. “Chas” indeed is a categorical variable with levels 0 and 1.
Concerning correlations, many variables appear to be correlated. For example, crime appears to be most associated (inversely correlated) with the distance to employment centers. Note that the circles represent spearman’s correlation coefficients. “Chas” is not correlated with the values, as it is a binomial variable.
let’s edit the data so that we can do some clustering and classification.
#scale Boston and turn it into a data frame. The mean of all variables becomes 0 and values are SD's from center.
scaled_b <- Boston %>% scale %>% as.data.frame
summary(scaled_b)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
#create a factor of crime rate. With "mutate(crim = crim...)" we override the old values of 'crim' and we don't need to drop any old variables
bins <- quantile(scaled_b$crim)
scaled_b <- scaled_b %>% mutate(crim = cut(scaled_b$crim,
breaks = bins,
include.lowest = TRUE,
label = c("low", "med_low", "med_high", "high")))
# Divide the dataset to 'train' and 'test', 80% and 20%
n <- nrow(scaled_b)
ind <- sample(n, size = 0.8*n)
train <- scaled_b[ind,]
test <- scaled_b[-ind,]
lda_boss <- lda(crim ~ ., data = train)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#numeric classes of crime rates
classes <- as.numeric(train$crim)
# plot the lda results
plot(lda_boss, dimen = 2, col = classes, pch = classes)
lda.arrows(lda_boss, myscale = 2)
#test predicted and observed classes
observed <- test$crim
predicted <- predict(lda_boss, newdata = test)
crosstab <- table(correct = observed, predicted = predicted$class)
crosstab
## predicted
## correct low med_low med_high high
## low 15 8 1 0
## med_low 7 15 3 0
## med_high 0 9 14 2
## high 0 0 1 27
“High” crime rates appear to be correctly predicted. In other classes, around 50% of predictions are correct, but most wrong predictions are in adjacent classes of crime rate (e.g. correct class med_low, predicted class med_high).
Let’s reload the data for clustering.
#reload & scale Boston
data('Boston')
boston_scaled <- Boston %>% scale %>% as.data.frame
#euclidean distances of Boston
dist_eu <- dist(boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
#how many centroids should we have?
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
the optimal number of clusters could be 2 because the steepest drop ends there. Let’s however select 3. The sum of squared distances drops nicely even after that.
# run the K-means cluster analysis again with 3 centroids and plot the results
kmc <- kmeans(boston_scaled, centers = 3)
Boston$cluster <- as.factor(kmc$cluster)
pairs(Boston[5:13], col = Boston$cluster, pch = 19, cex = 0.2)
Boston %>% gather(key, value, -cluster) %>%
ggplot(., aes(x = key, y = value, group = cluster, col = cluster)) + geom_boxplot() +
facet_wrap("key", scales = "free", shrink = TRUE)
First figure shows the variable pairs visualized with scatter plots. The biggest differences between the 3 clusters are in variables age, criminal rates, distance to employment centers, proportion of lower-status residents (lstat), value of houses (medv), amount of nitric gas (nox), students to pupils -ratio (pratio) and so on.
set.seed(123)
# Create another K-means clustering with 5 centroids.
data("Boston")
boston_scaled <- Boston %>% scale %>% as.data.frame
kmc_bonus <- kmeans(boston_scaled, centers = 5)
boston_scaled$cluster <- kmc_bonus$cluster
lda_bonus <- lda(cluster ~ ., data = boston_scaled)
# visualisation. Using the same functions than in previous sections. "col" and "pch" are set to boston$clusters
plot(lda_bonus, dimen = 2, col = boston_scaled$cluster, pch = boston_scaled$cluster)
lda.arrows(lda_bonus, myscale = 2.5)
A couple of variables stands out in these clusters - Black and crim seem to contribute as individual variables. It should be noted that the results were very varying, before adding “set.seed(123) -line.”
library(plotly)
#copy-pasting the code
model_predictors <- dplyr::select(train, -crim)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda_boss$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda_boss$scaling
matrix_product <- as.data.frame(matrix_product)
# k-means clustering for colors
kmc_superbonus <- kmeans(model_predictors, centers = 5)
model_predictors$cluster <- kmc_superbonus$cluster
# 3D plots
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, color = train$crim)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, color = model_predictors$cluster)
Comments on preliminary associations
Different aspects of learning and points